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This work builds a unified framework for the study of quadratic 
form distance measures as they are used in assessing the goodness of 
fit of models. Many important procedures have this structure, but the 
theory for these methods is dispersed and incomplete. Central to the 
statistical analysis of these distances is the spectral decomposition of 
the kernel that generates the distance. We show how this determines 
the limiting distribution of natural goodness-of-fit tests. Additionally, 
we develop a new notion, the spectral degrees of freedom of the test, 
based on this decomposition. The degrees of freedom are easy to 
compute and estimate, and can be used as a guide in the construction 
of useful procedures in this class. 

1. Introduction. Modern scientific work has presented statistics witli 
many important challenges, but of particular importance are the challenges 
presented by "large magnitude," both in the dimension of data vectors and 
in the number of vectors [see Lindsay, Kettenring and Siegmund (2004)]. 
Assessment of the fit of a model in such a situation can be challenging. 

Model fit assessment is usually based, either explicitly or implicitly, on 
measures of distance d{F, G) between probability measures F and G. Our 
foundation stones will be quadratic distance measures. This class is charac- 
terized by the simple quadratic form structure 

dK{F,G)= J I KG{s,t)d{F-G)is)d{F-G)it), 
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that is adaptable through the choice of a nonnegative definite kernel Kg{s, t). 
This form is asymmetric in F and G; here G will often be a distribution 
whose goodness of fit we wish to assess, and F will often be a nonparametric 
estimate F of the true sampling distribution Ft-. 

There are a number of important reasons why quadratic distances are 
central to the study of goodness of fit. These will be discussed in Section 2. 
A central goal here is to fill in some major gaps in the theory of quadratic 
distances. One major new result of this paper is the derivation of the lim- 
iting distribution for quadratic distances when used as goodness-of-fit tests 
in parametric models. These results depend on an appropriate spectral de- 
composition of the kernel K. We derive several new examples of such de- 
compositions. 

However, in many potential applications the numerical difficulty of de- 
termining the full spectral decomposition would make use of the limiting 
theory impractical. Our second set of major new results concerns the role 
of spectral degrees of freedom (sDOF), a concept introduced in this paper. 
We show that the limiting distributions involved are well approximated by 
chi-squared distributions when the degrees of freedom are large. Moreover, 
the sDOF are easily estimated empirically. 

For kernel smoothing-based L2 distances this is especially important be- 
cause degrees of freedom are a more natural measure of the operating char- 
acteristics of the quadratic distance than are the bandwidth parameters. 
The literature on quadratic distances contains virtually no discussion of a 
concept we find critical. That is, in multivariate goodness of fit it is impor- 
tant to construct tuneable distances so that one can adjust the operating 
characteristics of the procedure to the dimension of the sample space and 
the sample size, much as one would do in a chi-squared analysis. 

1.1. The formal setup. Let 5 be a sample space, with measurable sets i3, 
and let du{s) be the canonical "uniform" measure on this space. The building 
block for our distance will be K{s,t), a bounded, symmetric kernel function 
on S X S. In analogy with matrix theory, a kernel is called nonnegative 
definite (NND), if the quadratic form JJ K{s,t) da{s) da{t) is nonnegative 
for all bounded signed measures a, and it is conditionally NND (i.e., CNND) 
if nonnegativity holds for all a satisfying the condition / da{s) = 0. 

Although our theoretical developments will be given for abstract spaces 
S, it is important that for data calculations we will use discrete spaces. If 
a is finite discrete, with masses at si, . . . , Sm, then the CNND requirements 
reduce to the conditional nonnegative definiteness of the matrix IC having 
i,j element K{si,Sj). 

Definition 1. Given a CNND Kcis^t), possibly depending on G, the 
J^-based quadratic distance between two probability measures F and G is 
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defined as 

(1.1) dK{F,G)= J I KG{s,t)d{F-G){s)d{F-G){t). 

Note that the distance is weh defined even when F and G do not have 
densities with respect to a common measure. The calculation for dK{F,G) 
can be written in the form 

dxiF, G) = K{F, F) - K{F, G) - K{G, F) + K{G, G), 

where we have used the shortcut notation K{A,B) = JJ K{s,t) dA{s) dB{t). 
We will call dxiF^G) the empirical distance between the data and the G. 

The discrete/matrix version of the problem will be of considerable statis- 
tical interest for its use in estimation. Let F^ be the true sampling distri- 
bution and F the empirical distribution of a sample xi, . . . , Xn from Fr- Let 
be the nx n empirical representation of the kernel Kq, having ijth ele- 
ment Kcixi^Xj). In this case a quantity such as // Kcix^y) dF{x) dF{y) = 

i^K^l/v? estimates // Kcix^y) dFr{x) dFr{y). 

A possible practical limitation of quadratic distances is that numerical cal- 
culation of the distance requires twofold integration over the sample space. 
If the integrals are not explicit, one approach would be to perform Monte 
Carlo integration to calculate the distance, which in turn requires a sim- 
ulation algorithm for the distributions involved. However, it is sometimes 
possible to choose a model-specific kernel that makes the distance calcula- 
tion explicit and fast. This in turn enables one to construct test procedures 
that rely on other computationally intensive devices like bootstrapping. 

2. The central role of quadratic distance. In this section we offer rea- 
sons that quadratic distance-based methods are central to goodness-of-fit 
inference. 

2.1. Important quadratic distances. A number of classically important 
distances, such as Pearson's chi square or Cramer-von Mises, are quadratic 
distances. Other more recent examples can be found in Fan (1997, 1998), 
Fan, Zhang and Zhang (2001) and Zuo and He (2006). 

L2 distances. In 2? = {1,2, . . . , N} 01 N = {0, 1,2,...} one can use the 
"identity kernel" K{s^ t) = I[s = t] and get the ordinary L2 distance "^{fii) — 
g{i))'^ . However, in TZ using the identity kernel gives the integral 

yy I[x = y\{f{x) - g{x)){f{y) - g{y)) dxdy, 

which is identically zero. We will later show how to construct kernels that 
approximate the identity kernel, and hence the L2 distance. 
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Similarly, the Pearson kernel 
(2.1) Ka(s.t)- "^ = '1 



which nominally gives the Pearson distance /(/(s) — g{s))'^ / g{s) du{s) be- 
tween two densities, can be used in V or AA, but cannot be used in TZ. 
This distance will be important to our story, as it is the quadratic distance 
approximant of Kullback-Leibler distance, as will be shown. 

An unconventional example. A kernel that is quite popular as a smooth- 
ing kernel is the normal kernel with smoothing parameter h [Silverman 
(1986)]. The special computational utility of this kernel derives from the 
convolution identity 

(2.2) -^/if+ftl {x, y)= j K^2 (x, z)K^2^ (z, y) dz. 

The identity implies if we use the normal kernel Ki^2 together with the 
normal model G = K^2{x, ^) we obtain an explicit, no-integration-needed 
formula for the empirical distance given as dK{F,G) = Kf^2{F,F) — 
2n~^ J2i Kf^2_^„2 {xi, n) + Kj^2j^2ct'^ (/^i A*)- This same computational facility car- 
ries over if G is a finite mixture of normals. 

2.2. Relationship to L2. For a given symmetric kernel K{s,t), there ex- 
ists a symmetric kernel K^^"^ satisfying the relationship 

K^^^{s, r)K^/^{r, t) du{r) = K{s, t). 

(Existence will follow from the spectral decomposition that follows later.) 
For the normal kernel, (2.2) shows that i^/i2/2 is the square root kernel of 
the normal kernel Kfi2 . 

The square root operation leads us to a natural interpretation of the 
quadratic distance as an L2 distance between smoothed densities. 

Proposition 1. Let K{s,t) be a symmetric, nonnegative definite ker- 
nel. Then 

dK{F,G)= [{r{z)-g*{z)fdz, 



where f* {z) = J K^'^ {z, r) dF{r) and g* {z) = J K^/^ {z, r) dG{r) 
Proof. By reversal of order of integration. □ 



From the above proposition we see that if we use F instead of the em- 
pirical distance dK{F,G) represents the L2{dz) distance between the kernel 
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density estimator f*{z) and the smoothed G distribution. Moreover, note 
that for the normal kernel the above relationship implies that the kernel is 
positive definite. 

Conversely, any kernel smoothing problem can be turned into a quadratic 
distance problem. If, for example, kh{x — y) is a smoothing kernel on TZ that 
is used to construct the density estimator, the corresponding smoothed L2 
distance arises from the nonnegative definite kernel: 

.(...^/..(.-*.-.).. 

This formula provides a simple way to generate CNND kernels from other 
kernels k. 



2.3. von Mises expansions. We illustrate now that every smooth distance 
measure can be approximated, in a local sense, by a quadratic distance. To 
do this, we use the idea of von Mises expansion. 

Consider the Kullback-Leibler distance d{F,G) = E«=o /(^) ln[/(i)/c/(i)] 
defined on M. The infiuence function is 

00 

T^o(.)=inr(.)M.)-^r(i)in[r(.)MO]. 

Notice that the influence function is identically zero if "the null is true." 
Moreover, for the Kullback-Leibler distance, the Hessian is 

Thus, when the expansion point is f° = g, the quadratic approximation to 
Kullback-Leibler is the Pearson chi-squared distance: 

g/(.)ln|/(.)/,(.)|.£ I^W-f« 

i=0 i=0 ^^''> 

3. The decomposition theorem. We now turn to discuss briefly the im- 
portant role of spectral theory in determining the limiting distribution of 
the empirical quadratic distance dx{F,G) between the data-based empiri- 
cal distribution F and a hypothetical model G. 

3.1. Functional spectral decomposition. Let K{x,y) be a real-valued 
QS-measurable positive deflnite kernel function on a measure space (5, *B, M) . 
The functional spectral decomposition of a kernel is similar to the spectral 
decomposition of a matrix with one very important exception: the functional 
spectral decomposition depends on the underlying measure M. In our usage, 
the distribution M will usually be F^-, the true distribution of the data. If 
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we are calculating the decomposition under the null hypothesis, M will be 
G for the simple hypothesis Hq : F.,- = G, and for composite null hypotheses 
Hq'.Ft G {Gg}, M will be one element of the parametric family of distri- 
butions. We will call M the baseline measure and require that the kernel 
satisfies 

(3.1) / I K{x,yfdM{x)dM{y) <oo. 
Js Js 

This will hold for many typical examples because M is a probability mea- 
sure and K is bounded. Such a kernel K{x,y) generates a Hilbert-Schmidt 
operator on L2(M) through the operation {Kg){x) = J K{x,y)g{y) dM{y). 
Our treatment here largely follows Yosida (1980). 

Theorem 3.1. A nonnegative definite kernel K satisfying (3.1) can he 
written as 

oo 

(3.2) K{x,y) = Y.Xj<Pj{x)(l)j{y), 

i=i 

where Aj's and (j)j{xy s are eigenvalues and corresponding normalized eigen- 
vectors of K under baseline measure M . The series in (3.2) converges strongly 
to K; that is, for every g in L2, 




Moreover, Xj > since K is NND. 

If K{x,y) is real-valued and symmetric, then K \s a. self-adjoint operator. 
The decomposition of K{x,y) given in (3.2) corresponds to the spectral 
decomposition for a compact, self-adjoint operator, and will be called the 
{K, M) spectral decomposition. 

If M equals the empirical measure F, then m{xi) = 1/n, the uniform 
density. Let IC be the nx n empirical matrix with ijth element K{Xi,Xj). 
It is then clear that the {K,F) eigendecomposition is just the same as the 

matrix eigendecomposition of the empirical kernel K except that eigenvector 
normalization is changed from ||</>|p = 1 to 

J 4>'^{x)dF{x)=n~^\\4>\\^ = 1. 
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3.2. Spectral trace of a kernel. Fortunately, the most important attributes 
of the spectral decomposition can be calculated (or estimated) without ob- 
taining the full decomposition. First, a consequence of the spectral decom- 
position theorem is that we can calculate the sum of squared eigenvalues by 
integration: 

f^X] = J J K{x,yfdM{x)dM{y) <oo. 

We will denote the above quantity by traceM(-^^^) due to its relationship to 
the matrix trace calculation. 

The quantity traceM(-f^^) is easily estimated from data. Suppose for mea- 
sure M we use the true distribution F^- , and that Xi , . . . , Xn is a sample 
from Ft-. Then tvacep^^K'^) is estimated consistently by trace^(iC^), which 

equals tr(]fC^)/n^, where in the last expression we have used tr to denote the 
standard matrix trace operation. 

Many kernel functions satisfy even a stronger condition in that X^j^i -^j ) 
which we will write as traceM(-^)) is finite. The operators defined by those 
kernels are called nuclear. Under mild continuity assumptions one can also 
calculate tracejv/l-^) without decomposition. 

Lemma 1. Let K{x,y) be a NND Hilbert-Schmidt kernel and let Aj, 
i = 1, 2, 3, . . . , denote the eigenvalues of the corresponding operator. Suppose 
that K{x,y) is continuous at {x,x) for almost all x with respect to the mea- 
sure M. Then, a necessary and sufficient condition for J2'j^i <oo is that 
jK{x,x)dM{x) converges. Moreover, i/ < oo, then 

oo „ 

^ Aj = / K{x,x) dM{x) = tra,ceM{K). 
i=i 

For the proof of this lemma see Yang (2004). 

Once again, tiaceF^{K) admits a simple consistent estimator, namely 

tr(K)/n. These empirical estimators of trace quantities will be important 
later, as they enable one to approximate the limiting distributions of the 
test statistics through degrees-of-freedom calculations. 

3.3. An interpretation; plus centering. Kernels and their representations 
are heavily used in support vector machines, where the eigenfunctions repre- 
sent the "features" of importance in the problem, and the eigenvalues repre- 
sent the weight attached to those features [Hastie, Tibshirani and Friedman 
(2001)]. 
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Similarly, in a statistical distance, we can write 

dK{F,G)= J I K{s,t)d{F-G){s)d{F-G){t) 

so that the eigenvalues indicate the weight (importance) given the squared 
deviations in the features, which for us are the difference in expected values 
of the eigenfunctions under the two distributions. 

However, there is an important detail missing. Because both F and G 
are probability measures, it is an easy exercise to show that K{x,y) and 
K*{x, y) = K{x, y) + a(x) + a(y) + 6 both generate exactly the same quadratic 
distance dK{F,G), for any functions a{x) and scalar b. However, K and K* 
need not give the same spectral decomposition. Fortunately, statistical con- 
siderations point to a particularly natural choice for a{x) and b to use in the 
spectral decomposition. If G is a hypothetical true model, then we should 
use the spectral decomposition of the following modified kernel. 

Definition 2. The G-centered kernel K, denoted by i^con(G)) is de- 
fined as Keen{G){x,y) = K[x,y) - K{x,G) - K{G,y) + K{G,G)Mhen the 
identity of G is clear from context, we will use notation -fCcciu K{x,G) = 
J K(x,y) dG{y), and the terms K{G,y) and K{G,G) are similarly defined. 

Note, by easy calculation, that 

K,,^ix,G) = I Kccn(x,y) X IdG(y) = 0. 

That is, the centering of K has forced the function (j)iix) = 1 to be an eigen- 
function of Keen, with eigenvalue 0. As a consequence, by orthogonality to (pi , 
all the nonzero eigenfunctions have mean zero under G : j (f)k{x) dG{x) = 0. 

The centering of the kernel is similar to a two-sided projection operation. 
If we are in P, the discrete case, if g is the uniform density as in the 

case of F, and is just the projection matrix Pi that projects onto 

the space of constant vectors, then 

(3.3) ]Keen = (I-Pl)K(I-Pi). 

This "bilateral projection" formulation will later motivate the centering 
technique used when G depends on estimated parameters. 

In addition, if we wish to estimate nonparametrically the kernel after it 
has been centered by the true distribution F^-, we can empirically center the 

empirical kernel matrix K., obtaining 

(3.4) 'lK;In = (I-IPi)t(I-Pi). 

We will later use this formula to estimate the total degrees of freedom. 
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3.4. Examples of spectral decompositions. In this section we will give 
several exact spectral decompositions. 



3.4.1. Poisson kernel. In this subsection we construct a kernel by spec- 
ifying the eigenvalues and eigenfunctions directly. The sample space will 
be the interval [0, 27r) and the baseline measure dM{x) will be the uni- 
form probability density on this interval [i.e., {2it)~^ dx\. The eigenvalues 
for the kernel will have a geometrically decaying nature, (Ai, A2, A3, . . .) = 
(1, p, p, p'^,p'^,p^, p^, . . .), where < p <1, with corresponding eigenfunctions 
(1, \/2cos(x), \/2sin(a;), \/2cos(2x), \/2sin(2a;), \/2cos(3x), -v/2sin(3x), . . .). 
Written in terms of its spectral expansion, this gives the kernel 

00 

(3.5) Kp(9, ^) = 1 + J2 2p^[cos{ke) cos{k(t)) + sin(A;^) sin{k(j})]. 

k=l 

If one rewrites the cosine and sine terms in terms of complex exponential 
terms, one can use the geometric series formula to arrive at an explicit 
representation. 



Lemma 2. 



(3.6) 



'''"^^ l-2pcos{0-cP) + p^ 

where < p < 1 and Q <6,(j) < 27r. 



Although not well known in statistics, this is the univariate version of 
the famous Poisson kernel. If we fix p and (/>, then it becomes a density 
function in the variable 9, with the parameter ^ as a location parameter 
and p as a dispersion parameter. This density has been used in statistics as 
a distribution on the unit circle, where it is known as the wrapped Cauchy 
distribution, first studied by Levy (1939) and Wintner (1947). 

In physics, it is the operator that gives the solution to the physical problem 
known as the "Dirichlet problem with boundary data" [e.g., Bhatia (2003)]. 
Additionally, it is a central tool in harmonic function theory [e.g., Axler, 
Bourdon and Ramey (2001)]. 

In this paper we will focus on the univariate version (3.6). Of importance 
to us here is that this distribution has a parameter, here p, that can be used 
to tune the degrees of freedom of the distance. Clearly, one could apply this 
kernel to distributions on any finite interval [a, h) by a suitable location and 
scale change in the variables. 

It is clear, using the infinite expansion (3.5) to do calculations, that af- 
ter centering by the uniform distribution M, the Poisson kernel has the 
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decomposition 

i^cen(0,<^)=^(^,0)-l 

= J2 ^P'' [cos{ke) cos(A;0) + sin(A;^) sm{k^)] . 

k=l 

It can also be shown that the appropriate convolution of two Poisson kernels 
is still a Poisson kernel, so this kernel is in many ways the natural analogue 
of the Gaussian one when one is considering data restricted to an interval. 

3.4.2. Normal kernel. Of central importance to statistics is the spectral 
decomposition of the univariate normal kernel K)^2{x,y) when the baseline 
measure is A^(0,cr^). A natural starting point is the Hermite polynomials. 
See Thangavelu (1993) for the relationships used here. 

Definition 3. The Hermite polynomials Hn{x) are defined by the re- 
lationship 

H^{x) = {-ire^'£,e-^\ 

As candidates for the eigenfunctions we create a family of scaled and 
damped Hermite polynomials 

Hn{x;a,b) = Hniax)e-^''''/\ 

for a and b positive. These are useful because, using a classical identity for 
Hermite polynomials called Mehler^s formula, we can create "spectral-like" 
expansions of Kfi2 in which the scaled and damped Hermite polynomials 
play the role of eigenfunctions (see the Appendix for the definition of w* 
and 7* used in the following results). 

Let 7j^(a;) be 'jnix; a{w* (r)) ,b{w* (r))) , as defined in (A.l). 

Theorem 3. Under baseline measure N(0,a'^) the kernel K^2{x,y) has 
the spectral decomposition J2'^=o'^f^"'^ni^)^niy)' where P = w*{r) and 

This representation shares with the Poisson kernel geometrically declining 
eigenvalues. It also captures the damped polynomial characteristic of the 
features used in the distance. 

4. Using distances for model assessment. In this section we give some 
of the necessary theory behind testing-based model assessment. 
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4.1. Estimation of the distance. The next result gives a key property of 
the G-centered kernel. 

Proposition 2. Let F, G he two arbitrary distributions. Then the quadratic 
distance between F, G can be written as 

dK{F,G)= J I K,U^,y)dF{x)dF{y). 

This proposition shows that, for a fixed model G, the empirical distance 
dK{F,G) = Kccn(G)iF,F) ■= Vn is a F-statistic [Serfling (1980)]. It can be 
calculated in matrix form as l-^IKcenl/'i-^- One can also unbiasedly estimate 
dKiFr^G), where F^ is the true distribution, by using the corresponding 
?7-statistic: 

(4.1) ^n= . EE-^ccn(:r,,3;,). 

The fundamental distinction between Un and Vn is the inclusion of the diag- 
onal terms Kccnixi,Xi), which have the nonzero expectation traceG'(-fCcon)- 
Under the null hypothesis F^ = G, the true distance d{FT-,G) is zero, and 
EciUn) = but Eciyn) = E[Kccn{X, X)\/n, so that traceG(-fCcon) represents 
the biasing term. 

4.2. Under the null. We start with the case where we have a prespecified 
null model G that we wish to test, using as test statistic Vn = dx^F, G) or the 
unbiased distance estimator Un{G). Letting F^ denote the true distribution, 
the null hypothesis is Hq : F^ = G. Given a spectral decomposition of the 
centered kernel Keen under G, say Kccn{x,y) = ^\i(t>i{x)(t)i{y), a heuristic 
derivation of the limiting distribution of dK{F,G) is quite easy. Write 

dK{F,G) = J I Y.\Mx)ct>i{y)dF{x)dF{y) 

oo 

1=1 

where the (j)^ are averages of mean-zero, variance- 1 variables that are uncor- 
related over i. (Recall that the mean-zero property requires the use of the 
centered kernel.) The obvious conclusion is that 

nVn^X*W, A=(Ai,A2,...), 

where x*(A) = J2 ^i^f is an infinite weighted sum of independent chi-squared 
variables. This is proved in Yang (2004). 
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The corresponding distributional result for the unbiased Un is that 

\Jn{n-l) Xccn Wi 

where Xccn(-^) = ~ !)• This result was given formally for Un in Liu 

and Rao (1995), and holds under the condition that -^i = tvacedK'^) < oo, 
which is weaker than the condition J2'iZi^i < ckd needed for Vn- Note that 
the result for Vn cannot be improved upon because the distribution x*W 
does not exist if -^i = oo. 

4.3. Under composite nulls. Next, consider the case where we wish to 
evaluate a parametric model {Ge : £ f2}. We will assume that the elements 
Gg of this model all have densities gg{x) with respect to a common measure 
dfi. A natural test statistic for the validity of this model is nVn = ndxiF, G^) 
(or the corresponding debiased statistic Un) where is a consistent estimator 
of 9 under the null hypothesis Hq : G {Gg}. If this method were applied to 
Pearson's kernel, for example, one would end up with Pearson's chi-squared 
test statistic. 

The presence of the estimated parameter in Vn necessarily makes finding 
the null distribution for general kernels K more difficult, but we will show 
here that one can turn this problem into an eigendecomposition problem 
by artful centering of the kernel. Results similar to those presented here 
were derived by Fan (1998) for the special case of the weighted quadratic 
characteristic function distance. 

Suppose that p-dimensional 6 is being estimated by the maximum likeli- 
hood estimator 9. We will assume that it can be expressed as a solution to 
the set of p likelihood equations 

^u(x,;0) = O, 

where the likelihood scores u satisfy Eg[u(X; 9)] = 0. Notice that we are here 
using the maximum likelihood estimator for the problem, not the minimum 
quadratic distance estimator. The reason is that we anticipate that one 
would most likely use the quadratic distance fit assessment in conjunction 
with a maximum likelihood estimation procedure. 

To find the distribution theory for the likelihood-estimated distance, we 
build a score-centered kernel from K as follows. First, we construct the ex- 
tended score vector u* = (l,u^)"^. We then define the extended information 
matrix for a single observation to be 

rg = Eg[n*guf]. 

We will then let P* be the kernel operator defined by 
(4.2) Pg*{x,y) = n*gixf-rg-'-uUy). 
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The following formula shows that P* can be interpreted as the projection 
operator onto the extended space of likelihood scores: 

(4.3) J Po*{x, y)uf {y)dGe{y) = ufix). 

The score- centered kernel is defined to be 

= K{x,y)- f P;{x,z)K{z,y)dGeiz) 

(4.4) 



Kix,z)Pg*{z,y)dGeiz) 
+ / / Pe*{x,z)K{z,w)Pe*{w,y)dGe{z)dGe{w) 



The key feature of the score-centered kernel K^^en that it is Gg-orthogonal 
to the scores and the constant 1, as indicated next. 

Proposition 3. The score- centered kernel satisfies 
K',,,,ix,y)u*iy)dGe{y) = 0. 



This is easily proved using the definition (4.4) and repeated use of the 
projection property (4.3). 

Note that the scores u are themselves orthogonal to the constant, that 
is, / u{x) ■ ldGg{x) = 0; therefore we could also have constructed the score- 
centered kernel by replacing (4.4) with 

(4.5) (^I-Pg).K,,^^^^yiI-Pg), 

where Pg represents the projection onto the scores u instead of the extended 
scores u*. 

In the discrete case, we can represent Pg{i,j) by matrix = uq^q^uJ , 
where ug is the N x p matrix with entries dg .[log gg{i)]. We then get the 
matrix formula 

]KL, = (I-PeB,).<,.(I-D,Pe), 

where Bgi is diagonal with diagonal entries ge{i)- 

The empirical distance between the data and the estimated model is then 

d^(F,G^) = j I Kt,{x,y)dF{x)dF{y). 

This can be verified by using the fact / u^(x) dF{x) = for maximum like- 
lihood estimators. 

This then leads to our main result. 
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Theorem 4. Given the regularity assumptions itemized in the proofs, 
under Gq we have 



n 



dK{F,G^)- j I Ki^,ix,y)dF{x)dF{y) 



If Kg^^^ has a spectral decomposition Yl,i^i4'i{x)4>i{y) with finite trace, it 
follows that 

ndK{F,G^)^X*W, 
where Ai,A2,... are the eigenvalues of the spectral decomposition of K^^^^. 

In the Appendix we outline the steps in the proof. Notice that while there 
exists a corresponding U-statistic estimator of the distance, in the composite 
null hypothesis case it is no longer an unbiased estimator. One might still 
expect it to have slightly better operating characteristics. 

5. Spectral degrees of freedom. We have now presented an asymptotic 
theory for quadratic distance methods that looks complicated and difficult to 
use. Except for certain carefully designed kernels, the spectral decomposition 
will be dependent on the underlying true model. It is likely there is not an 
explicit solution to the eigenequations. Even if the decomposition is known, 
the limiting distribution itself will depend on infinitely many A parameters. 

These difficulties are not as severe as they first appear, because the key 
features of the spectral decomposition can be well summarized by the values 
of two scalar parameters called the Pearson scale factor and the spectral 
degrees of freedom. These two parameters are sufficient, in an asymptotic 
sense, for the description of the limiting distribution of the distance. As an 
additional bonus, they can easily be calculated for a model or estimated 
from the data without any spectral decomposition whatsoever. 

5.1. Pearson scaling and DOF. Quadratic distances have no inherent 
scale. That is, replacing the kernel K with K* = a ■ K, for an arbitrary 
constant a, creates a new distance that is equivalent to K for most mathe- 
matical and statistical purposes. 

Given a null measure G, we propose to rescale kernels so that they are 
as similar as possible to some standard kernel. The most natural standard 
kernel is the Pearson kernel. We will show that if we replace K with aK, 
where the scale factor is 



(5.1) a = aG{K) 



2 ' 



then the quadratic distance generated by aK is scaled to match the Pearson 
kernel. 



THEORY FOR QUADRATIC DISTANCES 



15 



Given there is a fixed measure of interest G, we define a distance between 
the two kernels Ki and K2 via 

traceG(i^i-if2)' = J j {Ki{x,y) - K2{x,y)f dG{x) dG{y). 

Define the scaUng factor a so that aK is as similar as possible to the Pearson 
kernel Q by minimizing the distance 

, 2\ tracecCQ - aK)'^ 

^ ' ' = traceG(Q^) - 2a{tva.ceG{QK)) + traceG(-fsr^). 

Suppose for a moment we are in the finite discrete case, so we can write the 
Pearson kernel as Q{x,y) = l[x = y]/ \/g{x)g{y). In this case we have 

traceG(QK) = j j Q{x,y)K{x,y) dG{x)dG{y) 
^^^'^\ {x)g{x)du{x) 

Putting this into the expansion (5.2), the minimizing a is (5.1). 

In other cases, if one minimizes the modified objective traceG(— 2ai^r(5 + 
a'^K'^), one again ends up with scale factor adK). 

5.2. Spectral degrees of freedom. We next define the spectral degrees of 
freedom (under G) of K to be 

(5.3) DOF„(;f)-'™'=°(^'''-<S^.)= 



traceG(A-2) E^! ' 

Note that DOF g{K) equals traceG(a-i^), where a is the scale factor adK). 
That is, the spectral degrees of freedom is just the sum of the eigenvalues of 
the rescaled kernel. 

Fan, Zhang and Zhang (2001) found that the limiting normal distribution 
of their goodness-of-fit statistics had the mean-variance relationship of a 
scaled chi-squared distribution, and they used this to define the degrees of 
freedom of these tests. This relationship will be discussed in Section 5.4. 

We will use a Satterthwaite approximation [Satterthwaite (1946)] to the 
X*(A) distribution to interpret DOF. Recall that under the null the empirical 
distance converges asymptotically in distribution to a linear combination 
of independent chi-squared random variables. Suppose we find scale a and 
degrees of freedom DOF so that 

E{adK{F,G)) = E{xloF), 
YaT{adK{F,G)) = Yav{xloF)- 
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Solving these two equations with respect to a and DOF, we obtain 

2E{dK{F,G)) 



a ■ 



Var(dx(F,G)) 
and 



(5.4) DOF : 



Var((ii^(F,G)) 



Using E{dK{F, G)) = tiaceaiK) = E Ai and \w{dK{F , G)) = 2 traceG(i^2) = 
2SAf, we obtain that a is the Pearson scale factor and DOF is the same 
as defined in (5.3). 

5.3. Two examples. In this subsection we will use two examples to il- 
lustrate calculation of the degrees of freedom. For point of comparison, we 
start with a classical quadratic distance, the Cramer-von Mises, which has 
a surprisingly small degrees of freedom. We then turn to the Poisson kernel 
as an example of the class of tuneable diffusion kernels. We show the degrees 
of freedom can be tuned to any value from 2 to infinity. 

5.3.1. Cramer-von Mises kernel. The Cramer-von Mises kernel is given 
as K(u,v) = 1 — max(u,?;) [Lindsay and Markatou (2002)]; its centered ver- 
sion is given as 

K^eniu, v) = l- max(n, v) - ((1 - u^)/2) - ((1 - v^)/2) + (1/3). 
Using G the uniform measure on (0, 1), we obtain 

trace (ii'cen ) = J Kcen{u,u)du = j {^ + u^ — u)du = ^ 

and 

trace(Kc^e„) = ^con {u,v)dudv = ^. 

Thus, the degrees of freedom for the centered Cramer-von Mises kernel are 

D0F=<1M; = 2.5. 
(1/90) 

5.3.2. The Poisson kernel. For the Poisson kernel (3.5), let the baseline 
measure be uniform on [0, 27r). Centering the kernel gives us 

K,,^{6,4>) = Kp{e,(t>)-l. 

The following proposition gives the degrees of freedom of the centered Pois- 
son kernel. 
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Proposition 4. The degrees of freedom of the centered Poisson kernel 
with respect to the uniform measure are given by 

1-p 

Proof. From (3.7) the eigenvalues of the centered Poisson kernel with 
respect to the uniform measure are given by the set of functions {p, p, p^, p^, . . .}. 
Now 

oo 

i=i 

and 

Therefore, the degrees of freedom are as given above. □ 

When p ^ the degrees of freedom converge to 2, corresponding to the 
test that depends only on the first two eigenfunctions, whereas when p^ 1 
the degrees of freedom diverge to infinity. 

5.4. Satterthwaite limit theory. We now explore the relationship between 
the X*W distribution, its Satterthwaite Xk approximation and its normal 
approximation. A central assumption of this analysis is that we are consid- 
ering kernels (like the normal or Poisson) with a tuning parameter r/ such 
that the degrees of freedom become infinite as 77 — > 0. 

The construction of X*W ^ sum of independent random variables 
suggests that we might hope for a central limit approximation for this dis- 
tribution under the condition that the degrees of freedom are sent to infinity. 
If so, normality would imply that just two parameters would be sufficient to 
describe the distribution. We here give a simple sufficient condition for this 
result, and then go further. We will show that under the same conditions the 
Satterthwaite Xdof approximation provides a two-parameter approximation 
that is always superior to the normal approximation. 

We start by standardizing to mean zero and variance 1: 

Xstd(A) = ^==—. 

Since X*W is a sum of independent variables, it is natural to study the 
cumulants of this distribution. Note that the cumulants of the standard 
normal, other than r = 2, are given by Kr{Z) = 0, whereas k.2{Z) = 1, so 
these are the cumulants we might hope to find in the limit. 



1-p 

l-p2- 
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To study the cumulants we define 




Then we have —'^ ^'^d — \/DOF(A). If we are considering the 

important special case of the Xr distribution, then R of the Aj are 1 and 
the rest are zero. This gives 7^ = for R values of i and else. 

The following lemma gives the cumulants for the standardized chi-star 
distribution. It arises from a straightforward calculation using the properties 
of cumulant-generating functions. Note that the cumulants of Xi given 
by K,(xf)=2--i(r-l)!. 

Lemma 5. For r >2 the rth cumulant of standardized X*{.X) is 
For the Xr distribution, an important special case, this gives f^riXustd) ~ 

/t,(X?)-2-^-/2.^1-r/2^ 

The degree of normality of the chi-star distribution can be measured by 
the departure of its cumulants from the normal values. In this case, we can 
show that the third cumulant (skewness) is the key factor. 

Lemma 6. The normed cumulants 

are decreasing in r for r = 2, 3, 4, 

Proof. Each 7j is bounded above by 1, so 7[ > ■ □ 

We have the following consequence of the last lemma: if we use a tuneable 
kernel with eigenvalues depending on tuning parameter rj, then all the 
cumulants of 3 and greater order converge to as r/ ^ if and only if the 
skewness cumulant K3(x*t(j(A^)) does. Indeed, if the skewness goes to zero, 
one can use the standard Taylor expansion proof to verify that 

x:td(A,)^iV(0,l) as7?^0. 

For the Xr distribution, the skewness cumulant is K3(x/j std) ~ ^^''^ " 
reflecting its known convergence to normality. Below we will show that the 
skewness for the Poisson kernel goes to zero at the same rate in R, where R 
is its spectral degrees of freedom. 
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If the zero-limit skewness property holds, one might ask whether there 
would sometimes be a preference for using the normal approximation over 
the Satterthwaite approximation to XstdC'^)- '^^^ answer is never, because 
the following lemma indicates that every cumulant of Xstd(-^) closer to the 
Satterthwaite chi-squared cumulant than it is to the normal. 

Lemma 7. Let R be a positive integer. For r > 3, and for any x*W 
distribution with DOF(A) = R, 

Proof. See Appendix. □ 

That is, the cumulants are always larger in magnitude than the chi- 
squared ones, and further from zero, the normal theory value. This result 
also suggests that the magnitude of the skewness ratio 

'^3(Xstd,R) 

could serve as a reasonable index of the relative chi-squaredness of the chi- 
star distribution when the degrees of freedom are large. Additionally, the 
limit of this ratio as R becomes infinite could serve as a single number 
summary. 

For the Poisson kernel example, with p = e^'^ and using the uniform base- 
line density, we get 

^3(x:td(A)) _ jl + e-^)' 
«^3(xL,i?) (l + e-^i + e-^^i)' 

a term which converges to 4/3 as 77 — > 0. That is (and we found this surpris- 
ing), with geometrically declining eigenvalues the skewness of standardized 
X*(A) lies closer to the chi-square's skewness than the latter's does to the 
normal value of 0. In general, for the Poisson kernel the ratio of rth cu- 
mulants converges to 2^~^/r, showing that the rth cumulant is the same 
magnitude as the chi-square: 0(i?i-^/2), where R is the de grees of freedom. 

6. Final comments. Quadratic distances with tuning parameters are in 
many ways like smooth chi-squared goodness-of-fit tests: the L2 relationship 
suggests that, as an alternative to constructing a finite set of bins, we are 
creating an infinite number and averaging across their deviations. The spec- 
tral degrees of freedom concept is meant to be a tool to help statisticians 
exploit this analogy. 
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If we accept this analogy, then the choice of the degrees of freedom in a 
problem is like the choice of the number of cells in the chi-squared test: it 
is clearly extremely important in determining the power of the test against 
interesting alternatives, but it is also very hard to devise hard and fast rules 
about its choice. That is because the nature of the interesting alternatives 
may not be clear to the user. We might add, that provided one is using 
the test statistics as an exploratory tool, there is no reason one would not 
consider a range of interesting degrees of freedom as a means of exploring 
the possible deviations from the model at various scales of smoothing. (We 
think that informal/exploratory model confirmation is widely used, and this 
would simply be another instance.) 

What then is an interesting range for degrees of freedom? At this time, 
we can only offer a heuristic analysis based on chi-squared tests. In a very 
general sense, increasing the number of cells in such a test, and therefore 
the degrees of freedom, will create a gain in sensitivity to deviations that 
are localized within a single small area (like a bump in the density) , but also 
create increased variability of the test statistic that causes it to lose power 
against more global alternatives that create a small shift in probability in 
many cells. 

In a chi-squared test one would want, even if searching for small local 
deviations, some minimum sample counts per cell in order to cut variability. 
If that minimum were 5, one would never have more than ri/5 degrees of 
freedom. This is a number we have used as a rough upper bound when we 
investigated a problem. 

On the other hand, just as a chi-squared test with two cells would be 
too coarse for most purposes, one should avoid having too small a degrees 
of freedom. In this regard, the dimension of the sample space is important. 
To illustrate, if one were to provide a one-degree-of-freedom test on each 
marginal distribution in a D-dimensional data set, then one has used D 
degrees of freedom. To also test all the bivariate marginals would take an 
additional (^) degrees of freedom. Based on this heuristic, we have used 
{^2^) as a very rough lower bound when investigating multivariate data 
sets. 

One important issue we have not touched upon in this paper is that of 
power. It is very difficult to draw broad conclusions about test procedures 
based on their power characteristics because the dimension of the alternative 
space is infinite, and it is inevitable that the identity of the best performing 
test will be highly dependent on the alternative that is chosen. Spitzner 
(2006) developed a detailed simulation study that compared a variety of 
testing strategies for combining quadratic tests into a single test statistic. 
Best power? The answer depended on the structure of the alternative; Fan's 
adaptive Neyman strategy [Fan (1996)] worked the best overall in Spitzner's 
particular simulation settings, but was not a universal winner. 
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APPENDIX: PROOFS AND LIMITING DISTRIBUTIONS 
A.l. Proofs for Theorem 3. 



Lemma A.l. For any w £ (0, 1), let a = a{w) = \J ^^^^ and b = b{w) 



i^. Then 



Kh^{x,y) 



2-Kh 



E 

i=0 



^;"(l-^;^)V2 



Hn{x;a,b) ■ Hn{y;a,b). 



Proof. Mehler's formula states that for w G (0, 1) 

Cn{w)Hn{x)Hn{y) = exp P^^"^ +y)w 

n=Q ^ 



where 



Cn{w) 



^«(l_y;2)l/2n 



2"n! 



A series of algebraic manipulations leads to the desired representation. □ 

The above formula is not a spectral representation unless we can choose 
a and b so that the Hn{x; a, b) terms are orthogonal under the normal mea- 
sure. The following gives us the necessary condition on a and b. 

Lemma A. 2. Let a and b be two positive scalars satisfying — b"^ = 
(2c7^)~^. The functions 70, • • • ,7n5 • • • defined by 



aa 
V2^ 



(A.l) jn{x;a,b) = Hn{x;a,b)^J 

are orthonormal under the measure M = N{0,a'^) . 
Proof. We start with the fundamental identity 

/ Hrn{x)Hn[x)e-''^ dx = l[m = n]2''n! Vvr. 
Jn 

With a change of variables x = ay, the left-hand side becomes 

LHS = af Hm{ay)Hn{ay)e-'''y' dy 
Jn 

= a f HUay)e-'''y'"H^{ay)e-''y''\-^'^'-''^y'dy 
Jn 

= a [ Km{y;a,b)Hn{y;a,b)e-y'/^'''dy. 
Jn 
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We therefore have 




Hm{y; a,b)Hn{y; a,b)e 




I[m = n] 




as needed. □ 

The final trick is to try to select the scalar w* in the first lemma such that 
the functions a{w*) and b{w*) that are defined there satisfy the condition 
a? — iP' = (2(T^)~^ of the second lemma. Let r = h? /a"^, the ratio of the kernel 
and baseline variances. 

Lemma A. 3. Set w*(r) = 1 — ^[\/4r + — r\. Then w*{r) decreases 
monotonely from 1 to as a function of r, for r G (0,oo). For any h'^ and 
(7^, we have a{w*{r))'^ — b{w*{r))'^ = (2(j^)~^. 

Proof. The function w*{r) is the left-hand root of the quadratic equa- 
tion rw = (1 — w)'^. Inspecting the plot of the two sides of this quadratic 
equation verifies the listed functional properties. The last equality is easy 
algebra. □ 

A.2. Proofs for Theorem 4. First, we show that score-centering implies 
mean-centering of the derivatives of the kernel. 

Proposition A.l. // the kernel is score- centered under Fj- = Gq, then 
under regularity conditions 



The proof can be easily obtained by differentiation under the integral sign. 
These mean-zero properties can then be used to show the following: 

Proposition A.2. Under regularity conditions found in the proof, 



(A.2) 




In addition. 



(A.3) 
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into the above expression. We then assume root-n consistency of the 
MLE, so that \/n{d — 9) converges in distribution. We assume that if X 
and Y are independent from Gg, the kernels VKsciX,Y) and \/'^Ksc{X,Y), 
which are mean zero from Proposition A.l, have finite variance, as do 
VK,ciX,X) and V^Ksc{X,X). If so, then nJjVKscix,y)dF{x)dF{y) and 
^'^Ksc{x, y) dF{x) dF{y) converge in distribution. This then assures that 
the first two terms are of stochastic order Op(n~^/^) and Op(n~^), respec- 
tively. The remainder term is then no larger than Op{n~^/'^) under the as- 
sumption that the elements of the arrays V^K^^ {^i^) a-nd V^K^^ {X,X) 
are bounded by finitely integrable functions for 9* in a neighborhood of 9. 
□ 

A.3. Proofs for Lemma 7. We start by proving the result when the 
eigenvalue sequence is finite in length, say 7i,...,7Ar. We can then write 
1 • aj„ , where ai , . . . , om represent the M distinct val- 
ues possible among the 7^ and the represent the counts for each am, 
divided by N . The expression ^T^mO-m is therefore the rth moment of the 
distribution that puts mass vTm, at support point am- For this distribution 
we know the first two moments: J^'^mO'm — VR/N and '^T^mf^'in = l/N. 
We wish to know the minimum possible value of the rth moment over the 
possible distributions represented by TTm and am- 

We enlarge the class of allowable distributions to include every distribu- 
tion P with its support in [0, 1] . Under the theory of moments, the solution 
to this optimization problem is an extremal distribution having index 3/2. 
That is, the optimal P has two support points, one of which is or 1. We 
can exclude 1 because this would maximize the rth moment, leaving us with 
one support point of 0. However, the Xstd R distribution has the eigenvalue 
distribution P of index 3/2, putting all its probability on the two support 
points and l/\/i? with masses {N — R)/N and R/N, respectively, and so 
it has the extremal rth moment. The theorem is concluded by taking limits 
as — > 00. 
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